[logs] Implement Journaling Payload to Disk for Network Outages by angel-ddog · Pull Request #48143 · DataDog/datadog-agent

angel-ddog · 2026-03-20T19:35:06Z

What does this PR do?

Adds an opt-in disk retry mechanism to the logs sender. During network outages, when the HTTP/TCP destination enters its retry loop and the sender buffer fills up, payloads that would otherwise be silently dropped are now written to disk. When connectivity recovers, the payloads are replayed in FIFO order back through the normal send path.

This feature is disabled by default. Setting logs_config.disk_retry.max_size_bytes to a non-zero value enables it.

Motivation

Epic

During network slowdowns or complete outages, the logs pipeline drops payloads. The destination enters an infinite retry loop on the current payload, the DestinationSender buffer fills, and subsequent payloads are silently dropped. Customers lose log data with no recovery path. This change saves those payloads to disk and replays them when the network recovers.

Changes

New package: pkg/logs/sender/diskretry/

serialization.go: Binary payload serialization/deserialization with magic number, version header, and corruption detection
retrier.go: Retrier interface, DiskRetryManager (store, replay loop, disk capacity management, TTL expiry, startup reload), and noopRetrier for when disabled

Configuration

Key	Type	Default	Description
`logs_config.disk_retry.max_size_bytes`	int	`0`	Max disk space for retry files. `0` = disabled.
`logs_config.disk_retry.path`	string	`<run_path>/logs-retry`	Directory for retry files
`logs_config.disk_retry.max_disk_ratio`	float	`0.80`	Stop writing when filesystem usage exceeds this
`logs_config.disk_retry.file_ttl_days`	int	`7`	Remove retry files older than this

Describe how you validated your changes

Manual QA:

Validated with real Datadog intake via local TCP proxy (disk-retry-qa-real.sh): confirmed during-outage logs appear in Log Explorer after recovery

Script:

#!/bin/bash
#
# Disk Retry QA Script -- Real Datadog Intake via Local Proxy
#
# Runs a local TCP proxy (localhost:18443) that forwards to Datadog's intake.
# The agent is configured to send through this proxy. Stopping/starting the
# proxy simulates a clean network outage.
#
# Flow:
#   1. Start proxy, agent sends logs through it     -> verify in Log Explorer
#   2. Kill proxy (simulates outage)                 -> payloads go to disk
#   3. Stop log writer, let pipeline drain
#   4. Restart proxy                                 -> payloads replay to Datadog
#
# Prerequisites:
#   - Build the agent: dda inv agent.build --build-exclude=systemd
#   - Valid API key in dev/dist/datadog.yaml
#   - datadog.yaml must set: logs_config.logs_dd_url: "localhost:18443"
#   - disk_retry config enabled
#
# Usage:
#   chmod +x dev/disk-retry-qa-real.sh
#   ./dev/disk-retry-qa-real.sh

AGENT_BIN="./bin/agent/agent"
LOG_FILE="/tmp/test-disk-retry-real.log"
RETRY_DIR="/tmp/dd-logs-retry"
PROXY_PORT=18443
INTAKE_HOST="agent-http-intake.logs.datadoghq.com"
INTAKE_PORT=443
LOG_WRITER_PID=""
PROXY_PID=""

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
CYAN='\033[0;36m'
NC='\033[0m'

cleanup() {
    echo -e "\n${YELLOW}Cleaning up...${NC}"
    [ -n "$LOG_WRITER_PID" ] && kill "$LOG_WRITER_PID" 2>/dev/null
    [ -n "$PROXY_PID" ] && kill "$PROXY_PID" 2>/dev/null
    sleep 1
    echo "Done."
}
trap cleanup EXIT

# ── Helpers ──────────────────────────────────────────────────────────

start_proxy() {
    # TCP proxy: accepts TLS connections on localhost:18443, forwards to Datadog intake.
    # Uses socat if available, falls back to python.
    if command -v socat &>/dev/null; then
        socat TCP-LISTEN:$PROXY_PORT,fork,reuseaddr OPENSSL:$INTAKE_HOST:$INTAKE_PORT,verify=0 &
        PROXY_PID=$!
    else
        python3 -c "
import socket, ssl, threading, sys

def forward(src, dst):
    try:
        while True:
            data = src.recv(65536)
            if not data:
                break
            dst.sendall(data)
    except:
        pass
    finally:
        src.close()
        dst.close()

def handle(client):
    ctx = ssl.create_default_context()
    upstream = ctx.wrap_socket(socket.socket(), server_hostname='$INTAKE_HOST')
    upstream.connect(('$INTAKE_HOST', $INTAKE_PORT))
    t1 = threading.Thread(target=forward, args=(client, upstream), daemon=True)
    t2 = threading.Thread(target=forward, args=(upstream, client), daemon=True)
    t1.start()
    t2.start()
    t1.join()
    t2.join()

server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(('127.0.0.1', $PROXY_PORT))
server.listen(32)
print(f'TCP proxy listening on :$PROXY_PORT -> $INTAKE_HOST:$INTAKE_PORT', flush=True)
while True:
    client, addr = server.accept()
    threading.Thread(target=handle, args=(client,), daemon=True).start()
" &
        PROXY_PID=$!
    fi
    sleep 2
    echo -e "${GREEN}Proxy started on :$PROXY_PORT -> $INTAKE_HOST:$INTAKE_PORT${NC}"
}

stop_proxy() {
    if [ -n "$PROXY_PID" ]; then
        kill "$PROXY_PID" 2>/dev/null
        sleep 1
        PROXY_PID=""
    fi
}

start_log_writer() {
    local tag=$1
    (
        i=0
        while true; do
            echo "{\"message\": \"disk-retry-qa $tag line $i\", \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\", \"phase\": \"$tag\"}" >> "$LOG_FILE"
            i=$((i + 1))
            sleep 1
        done
    ) &
    LOG_WRITER_PID=$!
}

stop_log_writer() {
    if [ -n "$LOG_WRITER_PID" ]; then
        kill "$LOG_WRITER_PID" 2>/dev/null
        sleep 1
        LOG_WRITER_PID=""
    fi
}

retry_file_count() {
    if [ -d "$RETRY_DIR" ]; then
        sudo find "$RETRY_DIR" -name "*.retry" 2>/dev/null | wc -l | tr -d ' '
    else
        echo "0"
    fi
}

retry_dir_size() {
    if [ -d "$RETRY_DIR" ]; then
        sudo du -sh "$RETRY_DIR" 2>/dev/null | cut -f1
    else
        echo "0"
    fi
}

# ── Main ─────────────────────────────────────────────────────────────

echo "============================================"
echo "  Disk Retry QA -- Real Datadog via Proxy"
echo "============================================"
echo ""

# Clean state
rm -f "$LOG_FILE" 2>/dev/null
sudo rm -rf "$RETRY_DIR" 2>/dev/null
touch "$LOG_FILE"

# Check agent binary
if [ ! -x "$AGENT_BIN" ]; then
    echo -e "${RED}Agent binary not found at $AGENT_BIN${NC}"
    echo "Build it first: dda inv agent.build --build-exclude=systemd"
    exit 1
fi

# ── Phase 1: Normal operation ────────────────────────────────────────
echo -e "${GREEN}=== Phase 1: Normal operation ===${NC}"
echo "Starting proxy..."
start_proxy

echo "Starting log writer (1 line/sec, tagged phase=pre-outage)..."
start_log_writer "pre-outage"

echo ""
echo -e "${YELLOW}Start the agent in another terminal:${NC}"
echo ""
echo "  cd $(pwd)"
echo "  sudo $AGENT_BIN run -c dev/dist/datadog.yaml"
echo ""
echo "Check Datadog Log Explorer for: service:disk-retry-test"
echo ""
read -p "Press ENTER once the agent is running and you see logs in Datadog... "

echo ""
echo -e "${CYAN}Retry dir: $(retry_file_count) files, $(retry_dir_size)${NC}"

# ── Phase 2: Simulate outage (45 seconds) ───────────────────────────
echo ""
echo -e "${RED}=== Phase 2: Killing proxy (45s outage) ===${NC}"
stop_log_writer
start_log_writer "during-outage"

stop_proxy
echo "Proxy killed. Agent cannot reach Datadog."
echo "Logs tagged 'during-outage' being written. Payloads should go to disk."
echo ""

for i in $(seq 1 45); do
    count=$(retry_file_count)
    size=$(retry_dir_size)
    printf "  [%2ds] retry files: %-6s disk usage: %s\n" "$i" "$count" "$size"
    sleep 1
done

PHASE2_FINAL=$(retry_file_count)
PHASE2_SIZE=$(retry_dir_size)
echo ""
if [ "$PHASE2_FINAL" -gt 0 ] 2>/dev/null; then
    echo -e "${GREEN}Phase 2 result: $PHASE2_FINAL files, $PHASE2_SIZE on disk${NC}"
else
    echo -e "${RED}Phase 2 result: NO retry files -- check agent logs${NC}"
fi

# ── Phase 3: Stop log writer ─────────────────────────────────────────
echo ""
echo -e "${YELLOW}=== Phase 3: Stopping log writer ===${NC}"
stop_log_writer
echo "Waiting 15s for pipeline to flush to disk..."
echo ""

for i in $(seq 1 15); do
    count=$(retry_file_count)
    size=$(retry_dir_size)
    printf "  [%2ds] retry files: %-6s disk usage: %s\n" "$i" "$count" "$size"
    sleep 1
done

PRE_RECOVERY=$(retry_file_count)
PRE_RECOVERY_SIZE=$(retry_dir_size)
echo ""
echo -e "${CYAN}Files before recovery: $PRE_RECOVERY ($PRE_RECOVERY_SIZE)${NC}"

# ── Phase 4: Recovery ────────────────────────────────────────────────
echo ""
echo -e "${GREEN}=== Phase 4: Restarting proxy + new logs ===${NC}"
start_log_writer "post-recovery"
start_proxy

echo "Watching for drain (5 minutes max)..."
echo ""

LAST_COUNT=$PRE_RECOVERY
for i in $(seq 1 300); do
    count=$(retry_file_count)
    size=$(retry_dir_size)
    printf "  [%3ds] retry files: %-6s disk usage: %s\n" "$i" "$count" "$size"

    if [ "$count" = "0" ]; then
        echo ""
        echo -e "  ${GREEN}All retry files replayed at ${i}s!${NC}"
        break
    fi

    if [ "$count" -lt "$LAST_COUNT" ] 2>/dev/null; then
        echo -e "  ${CYAN}  ^ files decreasing -- replay active${NC}"
    fi
    LAST_COUNT=$count

    sleep 1
done

# ── Summary ──────────────────────────────────────────────────────────
echo ""
echo "============================================"
echo "  QA Summary"
echo "============================================"
FINAL_COUNT=$(retry_file_count)
FINAL_SIZE=$(retry_dir_size)
echo "  Phase 2 (outage):   $PHASE2_FINAL files ($PHASE2_SIZE)"
echo "  Pre-recovery:       $PRE_RECOVERY files ($PRE_RECOVERY_SIZE)"
echo "  Final:              $FINAL_COUNT files ($FINAL_SIZE)"
echo ""
if [ "$PHASE2_FINAL" -gt 0 ] 2>/dev/null && [ "$FINAL_COUNT" -lt "$PRE_RECOVERY" ] 2>/dev/null; then
    echo -e "  ${GREEN}PASS: Payloads stored and replaying to Datadog${NC}"
elif [ "$FINAL_COUNT" = "0" ]; then
    echo -e "  ${GREEN}PASS: All payloads replayed to Datadog${NC}"
else
    echo -e "  ${YELLOW}PARTIAL: Replay in progress, may need more time${NC}"
fi
echo ""
echo "Check Datadog Log Explorer for all three phases:"
echo "  service:disk-retry-test @phase:pre-outage"
echo "  service:disk-retry-test @phase:during-outage  (replayed from disk)"
echo "  service:disk-retry-test @phase:post-recovery"
echo ""
echo "You can now stop the agent."

Screenshots of Local Output (The replayed payloads appeared on the Logs Explorer as well):

This was my datadog.yaml:

Additional Notes

Replayed payloads use a minimal Origin with empty Identifier so the auditor safely skips registry updates without panicking
Files persist across agent restarts and are reloaded on startup
Replay throughput is currently bounded by the worker queue size; optimization is planned as a follow-up

agent-platform-auto-pr · 2026-03-20T19:43:55Z

Go Package Import Differences

Baseline: a58d244
Comparison: 5c0f26c

binary	os	arch	change
agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
agent	aix	ppc64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
iot-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
iot-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
heroku-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent-cloudfoundry	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
cluster-agent-cloudfoundry	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
dogstatsd	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
dogstatsd	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
process-agent	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
heroku-process-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
security-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
security-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
security-agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
system-probe	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
otel-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
otel-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry
privateactionrunner	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/logs/sender/diskretry

agent-platform-auto-pr · 2026-03-20T19:59:02Z

Files inventory check summary

File checks results against ancestor a58d2448:

Results for datadog-agent_7.80.0~devel.git.27.5c0f26c.pipeline.108241515-1_amd64.deb:

No change detected

…gs-to-disk merged main into my branch

… with this.

agent-platform-auto-pr · 2026-03-25T17:48:01Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 6ba456e
📊 Static Quality Gates Dashboard
🔗 SQG Job
SOME SIZE DELTAS ARE N/A (ANCESTOR METRICS NOT YET AVAILABLE). RETRY JOB

Successful checks

Info

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	N/A	N/A → 753.179 → 753.380
✅	agent_deb_amd64_fips	N/A	N/A → 710.083 → 713.900
✅	agent_heroku_amd64	N/A	N/A → 313.383 → 320.580
✅	agent_msi	N/A	N/A → 605.070 → 651.440
✅	agent_rpm_amd64	N/A	N/A → 753.163 → 753.350
✅	agent_rpm_amd64_fips	N/A	N/A → 710.067 → 713.880
✅	agent_rpm_arm64	N/A	N/A → 731.580 → 735.290
✅	agent_rpm_arm64_fips	N/A	N/A → 691.497 → 696.840
✅	agent_suse_amd64	N/A	N/A → 753.163 → 753.350
✅	agent_suse_amd64_fips	N/A	N/A → 710.067 → 713.880
✅	agent_suse_arm64	N/A	N/A → 731.580 → 735.290
✅	agent_suse_arm64_fips	N/A	N/A → 691.497 → 696.840
✅	docker_agent_amd64	N/A	N/A → 813.484 → 815.700
✅	docker_agent_arm64	N/A	N/A → 816.672 → 821.970
✅	docker_agent_jmx_amd64	N/A	N/A → 1004.400 → 1006.580
✅	docker_agent_jmx_arm64	N/A	N/A → 996.366 → 1001.570
✅	docker_cluster_agent_amd64	N/A	N/A → 203.961 → 206.270
✅	docker_cluster_agent_arm64	N/A	N/A → 218.420 → 220.000
✅	docker_cws_instrumentation_amd64	N/A	N/A → 7.142 → 7.180
✅	docker_cws_instrumentation_arm64	N/A	N/A → 6.689 → 6.920
✅	docker_dogstatsd_amd64	N/A	N/A → 39.273 → 39.380
✅	docker_dogstatsd_arm64	N/A	N/A → 37.507 → 37.940
✅	dogstatsd_deb_amd64	N/A	N/A → 29.917 → 30.610
✅	dogstatsd_deb_arm64	N/A	N/A → 28.066 → 29.110
✅	dogstatsd_rpm_amd64	N/A	N/A → 29.917 → 30.610
✅	dogstatsd_suse_amd64	N/A	N/A → 29.917 → 30.610
✅	iot_agent_deb_amd64	N/A	N/A → 43.258 → 44.290
✅	iot_agent_deb_arm64	N/A	N/A → 40.309 → 41.920
✅	iot_agent_deb_armhf	N/A	N/A → 41.057 → 42.100
✅	iot_agent_rpm_amd64	N/A	N/A → 43.259 → 44.290
✅	iot_agent_suse_amd64	N/A	N/A → 43.259 → 44.290

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	N/A	N/A → 174.836 → 178.360
✅	agent_deb_amd64_fips	N/A	N/A → 165.387 → 172.790
✅	agent_heroku_amd64	N/A	N/A → 75.025 → 79.970
✅	agent_msi	N/A	N/A → 138.457 → 146.220
✅	agent_rpm_amd64	N/A	N/A → 177.628 → 181.830
✅	agent_rpm_amd64_fips	N/A	N/A → 167.691 → 173.370
✅	agent_rpm_arm64	N/A	N/A → 159.633 → 163.060
✅	agent_rpm_arm64_fips	N/A	N/A → 151.481 → 156.170
✅	agent_suse_amd64	N/A	N/A → 177.628 → 181.830
✅	agent_suse_amd64_fips	N/A	N/A → 167.691 → 173.370
✅	agent_suse_arm64	N/A	N/A → 159.633 → 163.060
✅	agent_suse_arm64_fips	N/A	N/A → 151.481 → 156.170
✅	docker_agent_amd64	N/A	N/A → 268.248 → 272.480
✅	docker_agent_arm64	N/A	N/A → 255.437 → 261.060
✅	docker_agent_jmx_amd64	N/A	N/A → 336.903 → 341.100
✅	docker_agent_jmx_arm64	N/A	N/A → 320.078 → 325.620
✅	docker_cluster_agent_amd64	N/A	N/A → 71.375 → 72.920
✅	docker_cluster_agent_arm64	N/A	N/A → 67.013 → 68.220
✅	docker_cws_instrumentation_amd64	N/A	N/A → 2.999 → 3.330
✅	docker_cws_instrumentation_arm64	N/A	N/A → 2.729 → 3.090
✅	docker_dogstatsd_amd64	N/A	N/A → 15.186 → 15.820
✅	docker_dogstatsd_arm64	N/A	N/A → 14.501 → 14.830
✅	dogstatsd_deb_amd64	N/A	N/A → 7.901 → 8.790
✅	dogstatsd_deb_arm64	N/A	N/A → 6.788 → 7.710
✅	dogstatsd_rpm_amd64	N/A	N/A → 7.913 → 8.800
✅	dogstatsd_suse_amd64	N/A	N/A → 7.913 → 8.800
✅	iot_agent_deb_amd64	N/A	N/A → 11.399 → 13.040
✅	iot_agent_deb_arm64	N/A	N/A → 9.718 → 11.450
✅	iot_agent_deb_armhf	N/A	N/A → 9.936 → 11.620
✅	iot_agent_rpm_amd64	N/A	N/A → 11.415 → 13.060
✅	iot_agent_suse_amd64	N/A	N/A → 11.415 → 13.060

cit-pr-commenter-54b7da · 2026-03-25T18:10:55Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: eaa443d7-ce6b-44d3-be50-5608fa3bc00b

Baseline: 2a91625
Comparison: 5c0f26c
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+0.79	[-2.29, +3.88]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	otlp_ingest_logs	memory utilization	+1.67	[+1.56, +1.77]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	+1.45	[+1.27, +1.63]	1	Logs
➖	docker_containers_cpu	% cpu utilization	+0.79	[-2.29, +3.88]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.78	[-0.83, +2.40]	1	Logs bounds checks dashboard
➖	tcp_syslog_to_blackhole	ingress throughput	+0.64	[+0.48, +0.79]	1	Logs
➖	ddot_logs	memory utilization	+0.55	[+0.49, +0.60]	1	Logs
➖	docker_containers_memory	memory utilization	+0.52	[+0.43, +0.61]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	+0.43	[+0.19, +0.67]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_cumulative	memory utilization	+0.42	[+0.27, +0.57]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	+0.18	[+0.12, +0.24]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.17	[+0.14, +0.20]	1	Logs bounds checks dashboard
➖	file_tree	memory utilization	+0.14	[+0.09, +0.20]	1	Logs
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	+0.11	[-0.11, +0.33]	1	Logs
➖	otlp_ingest_metrics	memory utilization	+0.09	[-0.06, +0.24]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.08	[-0.36, +0.52]	1	Logs
➖	ddot_metrics	memory utilization	+0.05	[-0.14, +0.25]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.05	[-0.45, +0.56]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.00	[-0.21, +0.21]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.11, +0.11]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	-0.01	[-0.21, +0.20]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.04	[-0.15, +0.08]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	-0.06	[-0.46, +0.33]	1	Logs
➖	quality_gate_idle	memory utilization	-0.37	[-0.42, -0.32]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	observed_value	links
✅	docker_containers_cpu	simple_check_run	10/10	695 ≥ 26
✅	docker_containers_memory	memory_usage	10/10	276.01MiB ≤ 370MiB
✅	docker_containers_memory	simple_check_run	10/10	682 ≥ 26
✅	file_to_blackhole_0ms_latency	memory_usage	10/10	0.19GiB ≤ 1.20GiB
✅	file_to_blackhole_0ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10	0.24GiB ≤ 1.20GiB
✅	file_to_blackhole_1000ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_100ms_latency	memory_usage	10/10	0.20GiB ≤ 1.20GiB
✅	file_to_blackhole_100ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_500ms_latency	memory_usage	10/10	0.22GiB ≤ 1.20GiB
✅	file_to_blackhole_500ms_latency	missed_bytes	10/10	0B = 0B
✅	quality_gate_idle	intake_connections	10/10	4 = 4	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	174.79MiB ≤ 181MiB	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	4 = 4	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	499.54MiB ≤ 550MiB	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	207.54MiB ≤ 220MiB	bounds checks dashboard
✅	quality_gate_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	345.99 ≤ 2000	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	419.21MiB ≤ 475MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.

…gs-to-disk merging into my branch

…pping

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fd249d52a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

s-alad

lgtm for agent-config file

… added message count bound to prevent OOM

…gs-to-disk merging with main

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 323a7e90ef

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

When a payload is stored to disk during an outage, notify the auditor so the tailer advances past those offsets and won't re-read them on restart. On shutdown, drain any payloads remaining in queue channels back to disk so they are not lost between enqueue and worker processing.

…lag in serialization

andrewqian2001datadog

LGTM

angel-ddog

Logs-agent architecture review

The diff introduces a disk-backed retry path inside the sender subsystem, but it changes the durable-progress contract in ways that conflict with the selected auditor and restart invariants. The main architectural risks are premature auditor advancement on local spool, replayed payloads no longer carrying enough origin identity to advance the auditor after actual delivery, and shutdown/restart sequencing that can strand in-flight payloads outside both the auditor and disk retry store.

Inline findings posted: 3

angel-ddog · 2026-04-24T03:02:33Z

+					// Try to save to disk instead of blocking the pipeline.
+					if err := s.retrier.Store(payload); err == nil {
+						// Update the auditor so the tailer advances past these
+						// offsets and won't re-read them on agent restart.


Worker advances the auditor when a payload is only spooled locally, not delivered to any reliable destination

When all reliable destinations are unavailable, the new path calls s.retrier.Store(payload) and then immediately pushes the same payload to reliableOutputChan, marking it as acknowledged for the auditor before any reliable destination has accepted it. The selected invariants require auditor progress to advance only after successful delivery to a reliable destination, and explicitly warn against treating buffered/intermediate storage as durable delivery. Local disk retry is an internal sender buffer, not a reliable destination acknowledgment. If the retry file later expires, is dropped for capacity, becomes unreadable, or replay never succeeds, the auditor may already have advanced past data that was never delivered upstream.

Context: invariants/auditor-delivery.md, invariants/sender-destination-semantics.md, architecture/pipeline-flow.md
Confidence: 0.98

angel-ddog · 2026-04-24T03:02:33Z

+}
+
+// diskRetrySource is a shared LogSource used for deserialized payloads.
+// It provides the minimum non-nil structure needed so that deserialized payloads


Replayed payloads intentionally lose origin identity, so successful replay cannot update durable offsets anymore

DeserializePayload reconstructs MessageMetadata with a synthetic shared source and an empty Origin.Identifier, and the code comment states that the auditor will skip registry updates for such payloads. Combined with the new worker behavior, this means the only auditor advancement for disk-spooled payloads happens at spool time, not at actual destination success time. That breaks the sender/auditor boundary described in the selected pages: after restart, replayed payloads can be delivered successfully without any corresponding real auditor acknowledgment path tied to the original origin metadata. Architecturally this turns disk retry into a parallel persistence ledger outside the auditor, which the restart invariants do not recognize as the source of truth for durable progress.

Context: invariants/auditor-delivery.md, components/auditor.md, architecture/logs-agent-overview.md
Confidence: 0.95

angel-ddog · 2026-04-24T03:02:33Z

 }

-// Stop stops all sender workers
+// Stop stops all sender workers and the disk retry replay loop.


Shutdown drains only queued payloads to disk and can drop payloads already dequeued by workers but not yet spooled or acknowledged

Sender.Stop() now stops the replay loop, then stops workers, then drains channel contents to disk. But workers remove payloads from queue channels before entering the reliable-send / disk-store loop. Any payload already dequeued into a worker when shutdown begins is not part of the post-stop queue drain, and there is no handoff ensuring it is either delivered, written to disk, or reflected in the auditor before the worker exits. The restart invariants require transient delivery components to stop cleanly before auditor flush, without dropped in-flight state. This stop ordering creates a concrete stranded-state window for in-flight payloads that are no longer in queues and not yet persisted anywhere durable.

Context: invariants/graceful-restart.md, components/restart-lifecycle.md, invariants/auditor-delivery.md
Confidence: 0.87

Implement Journaling Payload to Disk

d5942dc

github-actions Bot added the long review PR is complex, plan time to review it label Mar 20, 2026

dd-octo-sts Bot added internal Identify a non-fork PR team/agent-log-pipelines team/agent-configuration labels Mar 20, 2026

angel-ddog added this to the 7.79.0 milestone Mar 20, 2026

angel-ddog added 3 commits March 24, 2026 15:53

Fixed linter issues

6f461bc

Merge branch 'main' of github.com:DataDog/datadog-agent into angel/lo…

9895dbc

…gs-to-disk merged main into my branch

copyright linter error, and go mod lint fixed. Including reno release…

569c354

… with this.

angel-ddog added the qa/done QA done before merge and regressions are covered by tests label Mar 25, 2026

angel-ddog added 2 commits March 25, 2026 15:39

Merge branch 'main' of github.com:DataDog/datadog-agent into angel/lo…

ab419b4

…gs-to-disk merging into my branch

Bug Fix: Retried payloads now replay for dual shipping and single shi…

4fd249d

…pping

angel-ddog changed the title ~~Implement Journaling Payload to Disk~~ [logs] Implement Journaling Payload to Disk for Network Outages Mar 25, 2026

angel-ddog marked this pull request as ready for review March 25, 2026 20:23

angel-ddog requested review from a team as code owners March 25, 2026 20:23

angel-ddog requested a review from s-alad March 25, 2026 20:23

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread pkg/logs/sender/worker.go

Comment thread pkg/logs/sender/diskretry/retrier.go

Comment thread pkg/logs/sender/diskretry/serialization.go

s-alad approved these changes Mar 26, 2026

View reviewed changes

angel-ddog added 2 commits March 27, 2026 14:29

Fixed worker duplicate payload, added TTL on reload of live payloads,…

3fa04a4

… added message count bound to prevent OOM

Merge branch 'main' of github.com:DataDog/datadog-agent into angel/lo…

323a7e9

…gs-to-disk merging with main

angel-ddog marked this pull request as draft April 2, 2026 21:31

angel-ddog marked this pull request as ready for review April 2, 2026 21:31

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread pkg/logs/sender/worker.go Outdated

Comment thread pkg/logs/sender/sender.go

andrewqian2001datadog reviewed Apr 3, 2026

View reviewed changes

Comment thread pkg/logs/sender/diskretry/serialization.go

ddrthall reviewed Apr 9, 2026

View reviewed changes

Comment thread pkg/logs/sender/sender.go

angel-ddog added 2 commits April 13, 2026 14:08

fix(logs): prevent multi-reliable duplicate replay and preserve MRF f…

891bd21

…lag in serialization

Merge branch 'main' into angel/logs-to-disk

aa15425

andrewqian2001datadog approved these changes Apr 13, 2026

View reviewed changes

Merge branch 'main' into angel/logs-to-disk

5c0f26c

This comment was marked as off-topic.

Sign in to view

angel-ddog commented Apr 24, 2026

View reviewed changes

Conversation

angel-ddog commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Changes

Configuration

Describe how you validated your changes

Additional Notes

Uh oh!

agent-platform-auto-pr Bot commented Mar 20, 2026 • edited by dd-octo-sts Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go Package Import Differences

Uh oh!

agent-platform-auto-pr Bot commented Mar 20, 2026 • edited by dd-octo-sts Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files inventory check summary

Results for datadog-agent_7.80.0~devel.git.27.5c0f26c.pipeline.108241515-1_amd64.deb:

Uh oh!

agent-platform-auto-pr Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Info

Uh oh!

cit-pr-commenter-54b7da Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

s-alad left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrewqian2001datadog left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

angel-ddog left a comment

Choose a reason for hiding this comment

Uh oh!

angel-ddog Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

angel-ddog Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

angel-ddog Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

angel-ddog commented Mar 20, 2026 •

edited

Loading

agent-platform-auto-pr Bot commented Mar 20, 2026 •

edited by dd-octo-sts Bot

Loading

agent-platform-auto-pr Bot commented Mar 20, 2026 •

edited by dd-octo-sts Bot

Loading

agent-platform-auto-pr Bot commented Mar 25, 2026 •

edited

Loading

cit-pr-commenter-54b7da Bot commented Mar 25, 2026 •

edited

Loading